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CRISPR adaptive immune systems were analyzed for all available completed genomes of archaea, which included 
representatives of each of the main archaeal phyla. Initially, all proteins encoded within, and proximal to, CRISPR-cos loci 
were clustered and analyzed using a profile-profile approach. Then cas genes were assigned to gene cassettes and to 
functional modules for adaptation and interference. CRISPR systems were then classified primarily on the basis of their 
concatenated Cas protein sequences and gene synteny of the interference modules. With few exceptions, they could 
be assigned to the universal Type I or Type III systems. For Type I, subtypes l-A, l-B, and l-D dominate but the data sup- 
port the division of subtype l-B into two subtypes, designated l-B and l-G. About 70% of the Type III systems fall into the 
universal subtypes lll-A and 1 1 1 — B but the remainder, some of which are phyla-specific, diverge significantly in Cas protein 
sequences, and/or gene synteny, and they are classified separately. Furthermore, a few CRISPR systems that could not 
be assigned to Type I or Type III are categorized as variant systems. Criteria are presented for assigning newly sequenced 
archaeal CRISPR systems to the different subtypes. Several accessory proteins were identified that show a specific gene 
linkage, especially to Type III interference modules, and these may be cofunctional with the CRISPR systems. Evidence is 
presented for extensive exchange having occurred between adaptation and interference modules of different archaeal 
CRISPR systems, indicating the wide compatibility of the functionally diverse interference complexes with the relatively 
conserved adaptation modules. 



Introduction 

CRISPR adaptive immune systems are present in most archaea 
and in many bacteria where they primarily target and degrade 
invading genetic elements. The immune reaction involves three 
primary stages. First, adaptation involving selection of 30-45 bp 
DNA fragments (protospacers) from an invading genetic ele- 
ment and their insertion between repeats of genomic CRISPR 
arrays as de novo spacers. Second, transcripts of CRISPR arrays 
are processed generally within the repeat sequences to yield small 
CRISPR RNAs (crRNAs). Third, crRNAs are assembled into 
protein interference complexes and guide the complex to match- 
ing sequences on nucleic acid(s) of the invading genetic element 
which are then cleaved. 1 

Early classifications of CRISPR adaptive immune systems 
were based primarily on sequence analyses of CRISPR repeats 
or CRISPR-associated (Cas) proteins and yielded several distinct 
groupings. 2 " 4 There is now a consensus that CRISPR systems can 
be classified structurally into major classes denoted Types I, II, 
and III for bacteria and Types I and III for archaea. 5 The initial 
adaptation step is relatively conserved mechanistically among 
the three main CRISPR types and requires proteins Casl, Cas2 
and, in many systems, Cas4. Primary processing of CRISPR 
transcripts is accomplished with Cas6 in Type I and Type III 
systems and by a combination of RNase III and a tracrRNA in 



the bacterial Type II system. 1 The interference modules are much 
more varied with respect to both the number and sequences of 
the protein components involved, and the nucleic acid targets. 
Moreover, the high-sequence diversity of interference compo- 
nents, especially for the proteins labeled Cas7 and Cas8, provided 
a major obstacle to reaching a consensus about their classifica- 
tion. Despite this diversity, however, the three dimensional struc- 
tures of the different interference complexes (denoted Cascade) 
appear to be partly conserved. 6 " 8 Nevertheless, the mechanisms 
of interference are likely to be diverse, with Type I and Type II 
systems appearing to target primarily DNA while Type III inter- 
ference systems can target DNA or RNA. 9 " 13 

Type I and Type III systems have been further classified into 
subtypes based primarily on sequences of signature Cas proteins 
Casl, Cas3, and Cas8 for Type I and CaslO, and the small pro- 
tein component S (protein S) for Type III systems, and on their 
gene synteny. 14 However, there remain some limitations in the 
methods employed for identifying CRISPR subtypes. For exam- 
ple, categorizing systems according to the sequence of the larger 
conserved Casl protein is not always unambiguous and, more- 
over, Cas8, CaslO, and protein S are unsatisfactory signature pro- 
teins because of their variable size and high-sequence diversity. 

Type III systems are found in most archaeal phyla, and 
especially among extreme thermophiles, and their interference 
complexes are commonly encoded in gene cassettes that are not 
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Figure 1. 16S rRNA phylogenetic tree for all archaeal genomes included in the study. The tree shows the primary archaeal phyla with the total number 
of genomes analyzed in each phylum indicated (single genomes for a given phylum are not numbered). The putative kingdoms are color-coded on 
the bar on the far left. Crenarchaeota, green; Aigarchaeota, light green; Korarchaeota, yellow; Thaumarchaeota, light blue; Euryarchaeota, dark blue. 
Environmental habitats are color-coded on vertical lines to the right of the kingdom bar: hydrothermal, orange; marine, blue; wetland sediment, brown, 
and hypersaline lake, pink. For each phylum, the Type I and Type III CRISPR subtypes are color-coded on the right and the total numbers of each subtype 
are given. The branch length ruler corresponds to a 5% difference in 16S RNA sequence. 



linked genomically to either CRISPR loci or adaptation cas gene 
cassettes. 15 Earlier, we analyzed archaeal CRISPR-Cmr/Csm 
systems of archaea, based on CaslO (Cmr2/Csml) sequences, 
and concluded that there were five main families A, B, C, D, 
and E, of which the smallest and least well defined was fam- 
ily C. 15,16 Subsequently, these were defined as Type III systems 
and were separated into the major subtypes III-A and III-B for 



bacteria and archaea, 5 where subtypes III-A and III-B corre- 
sponded to the archaeal families E and B, respectively. Although 
the potential existence of families A and D was acknowledged 
by others earlier, 4 they were omited from the subsequent general 
classification. 14 

Previously, there have been reports of other proteins being 
encoded within or adjacent to archaeal CRISPR-Cas cassettes, 
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including mobile elements and toxin-antitoxin systems, 15,16 and 
in a study of Pyrobaculum species, evidence was presented for 
her A and nurA genes being specifically linked to cas genes. 17 Here, 
we systematically analyzed the numbers and degree of specificity 
of all genes that are associated physically with archaeal cas gene 
cassettes, and identified 12 additional accessory protein families. 

Results 

Analysis strategy 

First, we downloaded the 159 archaeal genomes available in 
May 2013 (www.ebi.ac.uk/genomes/archaea.html). The phy- 
logenetic diversity of the archaea and the number of sequenced 
genomes that were analyzed within each order are illustrated 
in the 16S rRNA-based phylogenetic tree (Fig. 1). All genomic 
loci-containing cas genes were first extracted automatically 
and the proteins were then clustered using Markov clustering. 
Subsequently, the loci were annotated manually aided by pro- 
file-profile searches against Conserved Domains (CDD) and 
TIGRFAMs (TF) databases. 3,5 We then followed the proposal 
of Makarova et al. 14 pertaining to the unification of Cas protein 
families involved in interference complexes of Type I and Type III 
systems throughout the analysis. Gene cassettes encoding adap- 
tation and interference modules were extracted together with 
the cas6 gene of the CRISPR RNA processing enzyme (Table 
SI, http://crispr.archaea.dk/TableSl.html). A significant frac- 
tion of the genetic modules were deficient or otherwise defective 
and they are also identified and tabulated (Table SI). Whereas 
93% of gene cassettes encoding Type I adaptation and interfer- 
ence complexes were found to be linked to one another and to 
CRISPR loci, for Type III systems, only 55% of the cas gene cas- 
settes and CRISPR loci were contiguous, and many interference 
complexes were encoded as separate genetic units (Table 1), con- 
sistent with them sharing adaptation modules and CRISPR loci 
with Type I systems. 13,15 

Given that many of the CRISPR systems were non integral, 
and that there is strong evidence for occurrence of exchange 
between adaptation and interference modules, 15 we analyzed 
the two functional modules separately. First, we determined the 
operon structures and gene syntenies within each genetic unit 
to define the modules. Next, we prepared separate dendrograms 
for the adaptation and interference modules by comparison of 
sequences of all their protein components and adding up scores 
for each module, which resulted in module-to-module distance 
matrices, which were then converted to dendrograms using the 
neighbor joining method (Figs. SI and S2, http://crispr. archaea. 
dk/FigureSl.pdf and http://crispr.archaea.dk/FigureS2.pdf, 
respectively). Subsequently, a detailed analysis of the different 
Type I and Type III subtypes was performed employing a combi- 
nation of properties of the gene cassettes, and the protein compo- 
nents, of the interference modules. 

Our criteria for defining subtypes were as follows. A vertical 
line is drawn near the base of the dendrogram in order to separate 
optimally the already established CRISPR subtypes, 5 and this line 
is defined as the subtype threshold. Branches occurring before the 
line represent separate subtypes whereas branches initiating after 



Table 1. Archaeal CRISPR systems. Total numbers of integral CRISPR 
immune systems and of independent interference modules 



CRISPR system 


Total adaptation + interference 


Interference 
alone 


Type I 


106 


11 


Type III 


16 


59 


Type l+lll 


57 


4 


Variants 


11 


20 



Table 2. Cas proteins associated with archaeal Type I and Type III CRISPR 
functional modules 



Function 


Type-I 


Type lll-A 


Type lll-B 


Adaptation 




Cas1 


Casl 


Casl 




Cas2 


Cas2 


Cas2 




Cas4 




Cas4 




Cas4' 






Processing 




Cas6 


Cas6 


Cas6 


Interference 




Cas7 


Csm3+5 


Cmr4+6+1 




Cas8 


Csm1 


Cmr2 




Cas5 


Csm4 


Cmr3 




Cse2/Csa5 


Csm2 


Cmr5 




Cas3 







the line belong to the same subtype. Exceptions were branches 
starting within the line. They were defined as separate subtypes 
if they showed consistent differences in gene synteny (as seen for 
subtypes I-G and Vj-2, see below), whereas if they showed similar 
gene syntenies they are inferred to represent divergent variants of 
the same subtype. These criteria can be readily applied to clas- 
sifying newly identified CRISPR systems if a dendrogram is first 
prepared using the approach outlined above. On applying these 
criteria to the archaeal genomes, we generated a comprehensive, 
manually curated catalog of all the archaeal CRISPR systems. 
The catalog is presented in Table SI and Figure S2 in a readily 
accessible form. It can easily be searched for individual organisms 
and genes, and researchers can both utilize and further analyze 
the data online. The results obtained were generally consistent 
with the current classification for Type I and Type III subtypes 
for bacteria and archaea, 14 but a number of archaea-specific prop- 
erties emerged, which are summarized below. 
Adaptation modules 

Initially, we generated a dendrogram of all the archaeal adap- 
tation modules employing a combination of concatenated pro- 
tein sequences and gene synteny of the adaptation Cas proteins 
of Type I and Type III systems (Table 2; Fig. SI). The threshold 
for defining adaptation subtypes was then determined to maxi- 
mize the matches to the universal CRISPR subtypes, 5 which 
were primarily Type I subtypes because most archaeal adaptation 
modules are linked genomically to Type I interference complexes 
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Figure 2. Dendrogram and gene syntenies of archaeal adaptation mod- 
ules of different CRISPR subtypes and the total numbers of identified 
subtypes are given on the tree. The dendrogram represents a simpli- 
fied version of the full dendrogram in Figure S1 showing the different 
classes of adaptation modules. Gene contents (indicated by bold white 
numbers), gene sizes, and gene syntenies are shown for representative 
adaptation modules. Further, the subtypes of interference modules asso- 
ciated with each class of adaptation module are shown. The threshold 
used to classify the variants is indicated by the light blue area toward the 
base of the tree, and it was selected so that the resulting classes would 
match optimally with the associated interference subtypes, which are 
mainly Type I. Nevertheless, some adaptation module classes are associ- 
ated with interference modules from different subtypes. 



(Table 1). Representative gene syntenies were then determined 
for each group of adaptation modules, and they are shown in 
Figure 2 superimposed on a simplified dendrogram. The com- 
plete adaptation dendrogram is presented in Figure SI. Many 
of the adaptation gene clusters are linked to single Type I sub- 
types but about 42% are linked to multiple subtypes (Fig. 2). 
Although the gene syntenies show significant variation for the 
different adaptation clusters, the protein sequences were highly 
conserved, and therefore, we did not use adaptation modules for 
further classification of CRISPR subtypes. However, they were 
useful for investigating the phenomenon of adaptation-interfer- 
ence modular exchange (see below) . 
Type I subtypes 

Currently, all Type I systems of bacteria and archaea are clas- 
sified into six subtypes I-A to I-F based on Casl, Cas3, and Cas8 
phylogeny and on the gene syntenies. 5 Although this classifica- 
tion scheme has been largely successful, identification of newly 
sequenced CRISPR systems is not always straightforward, and 
in particular, employing Cas8 as a signature protein can be prob- 
lematic. Often, newly sequenced CRISPR systems are not cov- 
ered by existing Cas8 models. Furthermore, sequence matches to 



published Cas8 models are often ambiguous or misleading. For 
example, the Cas8al_I-A model gives positive matches to Type 
I-B systems. 

Nevertheless, we could assign most archaeal Type I systems 
to current subtypes using our criteria. The results show a strong 
archaeal bias to subtypes I-A, I-B, and I-D, with very few exam- 
ples of subtypes I-C and I-E, and no I-F subtypes were found 
(Table 3). Our sequence analyses also support the division of 
the archaeal subtype I-B into distinct subtypes, designated here 
as subtype I-B and I-G, which correspond closely to the earlier 
proposed groupings Hmar and Tneap, respectively. 3 Four variant 
Type I subtypes were identified as Vj-I, Vj-2, V ( -3, and Vj-4 that 
each occurred in low numbers and tended to be phyla-specific 
(Table 4). For each subtype, variations in gene order were some- 
times observed but the same ORFs with similar sequences were 
still present. Typical subtype gene syntenies are illustrated on 
a simplified interference module dendrogram in Figure 3 and 
the complete archaeal interference dendrogram is presented in 
Figure S2. 

The interference module of subtype I-D is quite distantly 
related to the other Type I subtypes. In the dendrogram, it lies at 
the junction of the Type I and Type III systems (Fig. 3; Fig. S2) 
and it carries a CaslOd protein that shows low but significant 
similarity to both Cas8 and CaslO of Type I and Type III sys- 
tems, respectively (data not shown). Therefore, we infer that sub- 
type I-D may represent an intermediate between Type I and Type 
III interference modules. 

The total numbers of each Type I subtype found associated 
with different archaeal phyla are indicated on the phylogenetic 
tree (Fig. 1) and are summarized for the main archaeal kingdoms 
in Table 3. The results show significant differences between the 
kingdoms. There is a strong bias to subtype I-A systems among 
crenarchaea, which lack most other subtypes except I-B (one 
example) and I-D (7 examples) . In contrast, euryarchaea exhibit 
a more varied composition with multiple examples of subtypes 
I-A, I-B, I-G, and I-D, but with few instances of subtypes I-C 
and I-E. 

Newly classified Type I subtypes 

It is clear from the interference module dendrogram (Fig. S2) 
that the original subtype I-B contains independent groups, which 
branch off well before the subtype separation threshold, hence, 
it does not constitute a homogeneous cluster. Therefore, we pro- 
pose dividing subtype I-B into subtypes I-B and I-G (Fig. 3). 
Subtype I-G modules are also widespread among bacteria where, 
for example, they constitute the dominant Type I subtype in 
thermophilic Clostridium species. Cas8 sequences of the new I-B 
subtype match the CDD Cas8b_I-B model as predicted, 5 while 
Cas8 sequences from I-G modules match Cas8al_I-A, 5 further 
supporting the division into subtypes I-B and I-G. Furthermore, 
the Cas8al_I-A CDD model does not match any Type I-A Cas8 
sequences, demonstrating that the model is unsuitable for correct 
identification of Type I subtypes. 

The Vj-1 and V,-2 variants branch off from subtypes I-A and 
I-G (Fig. 3; Fig. S2). Both variants carry interference modules 
with five genes, whereas I-A and I-G modules exhibit six and 
four genes, respectively. When compared with subtype I-G, the 
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additional genes in variants V,-l and Vj-2 appear to have 
resulted from partitioning of cas8 into cas8' and cas8". 
Despite the similarity in gene synteny, variants V.-l and 
Vj-2 are distinct in their amino acid sequences, diverging 
before the subtype definition threshold (Fig. 3; Fig. S2). 
Moreover, their Cas8 sequences are not covered by any 
CDD 5 or TF 3 models. However, given that there are so 
few members and that they appear to be archaea-specific, 
Vj-1 and Vj-2 are classified as variants rather than subtypes 
(Table 4). 

Vj-3 was also reported earlier as a GSU0053-type mod- 
ule, 3 and more recently, as a variant of subtype I-C. 14 We 
were unable to confirm any link to subtype I-C and believe 
it represents an independent subtype, which should retain variant 
status until more examples are found in archaea and/or bacteria 
(Fig. 3). 

Methanocella conradii contains a Type I interference module 
designated Vj-4, which despite yielding close protein sequence 
similarity to a group of I-D systems from cyanobacteria, carries 
a Cas8 protein which does not have an HD domain, a key struc- 
tural signature of subtype I-D systems. However, the cofunc- 
tional Cas3 protein does contain an HD domain and, therefore, 
Vj-4 may represent an intermediate between subtype I-D and 
other Type I subtypes (Figs. 3 and 4). 

Type III subtypes 

Earlier, five CRISPR families A to E were proposed for archaea 
based primarily on the divergent sequences of the CaslO (Cmr2/ 
Csml) proteins. 15,16 Later, Makarova et al. 5 defined the univer- 
sal subtype III-A corresponding to family E and subtype III-B 
equivalent to family B. Here, all archaeal genomic copies of Type 
III systems were identified many of which were limited to inter- 
ference gene cassettes (Tables 1 and 2). Inspection of their posi- 
tions within the interference dendrogram (Fig. S2) demonstrated 
that about 70% of the interference modules fell unambiguously 
into the current universal subtypes III-A or III-B, which are com- 
mon to both crenarchaea and euryarchaea (Fig. 1). 

Here we reclassify the remaining 30% of the subtypes into 
subtypes III-C (formerly family A) and III-D (formerly fam- 
ily D); the earlier family C was integrated into subtype III-B. 16 
Ten examples of an additional subtype were observed that were 
specific to the order Sulfolobales and the subtype is classified as 
variant subtype V m -1 (Table 4). Typical gene cassettes for each 
subtype are illustrated in the simplified dendrogram (Fig. 4), 
and the complete Type III interference dendrogram is included 
in Figure S2. As for Type I subtypes, gene orders occasionally 
differ for a given subtype but the same ORFs with closely related 
sequences are retained. The distribution of Type HI subtypes 
within the different archaeal phyla is illustrated in Figure 1 and 
their total numbers and kingdom distributions are summarized 
in Table 3 where subtypes III-C and III-D show strong kingdom 
biases. 

Newly classified Type III subtypes 

Subtype III-C corresponds to the Type III systems defined ear- 
lier as MTH-326-like Type III 14 and archaeal family A. 1516 Eight 
examples of this subtype were found among archaea, and a fur- 
ther 29 were identified in bacteria (data not shown) . Although the 



Table 3. Number and kingdom distribution of the major subtypes of archaeal 
Type I and Type III systems 



Type 1 


Type III 


subtype 


number 


cren 


eury 


subtype 


number 


cren 


eury 


A 


69 


53 


14 


A 


32 


11 


21 


B 


26 


1 


25 


B 


49 


39 


9 


C 


2 


0 


2 


C 


8 


0 


8 


D 


18 


7 


11 


D 


16 


14 


1 


E 


5 


0 


5 










G 


27 


0 


25 











cren, crenarchaea; eury, euryarchaea. 



Table 4. Number and kingdom distribution of variant archaeal interfer- 
ence subtypes 



Type-1 


subtype 


number 


cren 


eury 


VI-1 


5 


0 


5 


VI-2 


2 


0 


2 


VI-3 


2 


0 


2 


VI-4 


1 


0 


1 


Type III 


subtype 


number 


cren 


eury 


VIII-1 


10 


10 


0 


VIII-2 


1 


0 


1 


Unclassified 


V-1 


9 


0 


9 


V-2 


1 


0 


1 


V-3 


1 


0 


1 



cren, crenarchaea; eury, euryarchaea. 

overall gene synteny of subtype III-C is similar to that of subtype 
III-B (Fig. 4), the encoded CaslO analog (Csxll, TIGR02682) is 
divergent showing no sequence similarity detectable by conven- 
tional sequence alignment and search methods. However, when 
employing sensitive profile— profile alignments, using HHsearch, 
we obtained a match with a 98% probability score, between most 
of the N-terminal half of the protein and Csml, the CaslO ana- 
log of subtype III-A. In contrast to the earlier report, 14 we found 
that the Cas5 analog of subtype III-C gives a full-length sequence 
match to Csm4, the Cas5 analog of subtype III-A and that it is 
not fused to Cas7. Finally, we were unable to detect any sequence 
similarity between protein S of subtype III-C and Csm2 or Cmr5 
of subtypes III-A and III-B, respectively. 

Earlier, the SSOl438-type Type III module was documented 
as being distinct from subtypes III-A and III-B. 4,16 With 16 exam- 
ples in diverse archaea and at least one in bacteria {Rhodothermus), 
we propose naming this subtype III-D. The CaslO analog is quite 
similar to those of subtypes III-A and III-B but subtype III-D is 
exceptional in carrying a higher number of RAMP genes, where 
Cas7 has diverged into four or five paralogs, in addition to the 
presence of the Cas5 analog (Fig. 4). Protein S of subtype III-D 
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Figure 3. Dendrogram and gene syntenies of archaeal interference 
modules of different Type I subtypes where the total numbers of identi- 
fied subtypes are given on the tree. Gene contents (indicated by bold 
numbers), gene sizes, and gene syntenies are shown for representative 
interference modules of each subtype. The dendrogram is a simplified 
version of the top half of the full dendrogram in Figure S2. The subtype 
threshhold is indicated by the light blue vertical line near the base of the 
tree. Most of the identified modules fall within already established sub- 
types. The new subtype l-G was separated from the earlier subtype l-B, 
and variant subtypes V-l, V-2, and constitute exceptional subtypes 
with few members. 



yields an almost full-length sequence match to the subtype III-A 
analog Csm2 using a profile— profile alignment. 

An additional group of distinct Type III modules was confined 
to Sulfolobus sp. genomes and was classified as a variant V m -1 
subtype (Fig. 4). The CaslO analog has diverged yielding no 
detectable sequence similarity to other CaslO proteins using con- 
ventional methods, although profile-profile searches reveal that 
the N-terminal half aligns significantly with CaslO components 
of subtypes III-A and III-B (with a probability of 95%). Another 
small ORF (-100 aa) encoding a putative aspartate protease is 
present in most, but not all, V m -1 modules. V m -1 modules carry 
two Cas7 paralogs and a single Cas5 analog (Fig. 4). Similarity 
was detectable along the entire length of protein S and its subtype 
III-D counterpart using a profile— profile search. Another small 
ORF (-125 aa) is found in all V m -1 modules. It is predicted to 
be mainly a-helical and yields no significant sequence matches 
in public databases. 

A further variant subtype V m -2 was detected in 
Ferroglobus placidus that carries a Cas5 and a Cas7 analog together 
with a small protein (89 aa) and a large protein (659 aa). A similar 



module also occurs in Thermotoga lettingae, and a profile-profile 
search with the larger protein gave a full-length match to Csml 
(99% probability). They constitute an interesting example of a 
minimal Type HI system with only a single cas7 gene. 
Unclassified CRISPR systems 

Several interference modules could not be classified as either 
Type I or Type III and they were categorised separately as variant 
CRISPR systems (Table 4). Nine examples of V-l were found 
among the Halobacteriales (Hlac_2813-2815 and homologs) 
with some haloarchaeal strains carrying multiple copies. V-l 
exhibits a small three-gene module encoding Cas5 and Cas7 ana- 
logs and a third ORF (-300 aa), which may be a Cas8/10 analog. 

Another variant V-2 was found in Thermococcus onnurineus 
(TON_0322-0325). It constitutes a Csf-type interference mod- 
ule, also known as Type U. 14 The V-2 interference module con- 
tains Cas5, Cas7, and Cas8/10 analogs and an additional small 
protein and it is encoded adjacent to an adaptation module. 

A third variant V-3 exhibits a single protein interference 
system identified as Cas_Cpfl on TIGRFAMs. 3 It lacks Cas3, 
Cas5, Cas7, and Cas8 and the interference function appears to 
be directed by the single protein, reminiscent of Cas9 in bacterial 
Type II systems except that Cpfl is only half the size of Cas9 and 
the two proteins do not appear to share any structural domains. 
At least 25 examples of V-3 were also detected in bacteria (data 
not shown) . 

Non-core proteins specific for CRISPR systems 

Genome context analyses revealed that non-core cas genes are 
often linked to cas gene cassettes of CRISPR-Cas systems and all 
identified examples of non-core c<?i-associated genes are marked 
in Table SI together with their cognate core cas gene modules. 
These accessory genes encode a variety of proteins, which include 
Csxl, Csx3, HerA, and NurA, as reported earlier. 3,5 ' 17 Several 
additional genes are listed in Table 5, some of which are spe- 
cific to CRISPR-Cas systems, which include the genes encoding 
Csxl and sRRM. Other linked proteins, including Cas_RecF, 
contain structural domains which are also encoded elsewhere but 
we were able to demonstrate a functional link to CRISPR-Cas 
systems because sequence analyses revealed that the proteins had 
coevolved significantly with their partner CRISPR-Cas system. 
Thus, Cas_recF proteins are more closely related to each other, 
sequence-wise, than they are to other proteins with recF domains. 

There was minimal evidence for the presence of non-core cas 
genes associated specifically with Type I systems (Table SI). Only 
a few examples of two different proteins were found, one a pre- 
dicted ABC ATPase and the other containing an RRM domain 
(Table 5). Exceptionally, several proteins were found associated 
with Type I systems of the order Thermococcales and they are 
listed separately in Table S2. The genes encoding these proteins 
flank CRISPR systems in Thermococcales regardless of their 
subtype. This may reflect that the CRISPR systems are borne on 
integrated elements or specialized genomic loci where the flank- 
ing genes, although conserved, have no direct functional link with 
CRISPR activity. 

In contrast, analysis of Type III systems yielded several differ- 
ent proteins encoded with interference modules, with some of the 
genes located in operons of two or three cas genes, consistent with 
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their being co-functional (Table SI). Examples of archaeal 
CRISPR-Cas gene cassettes carrying interspaced accessory 
protein genes, sometimes present in multiple copies, are 
color-coded violet in Figure 5. In total, 18 different acces- 
sory proteins were identified, 1 1 of which were specific for 
Type III systems, and seven were encoded together with 
combined Type I + Type III systems (Table 5). Many of 
the proteins are designated Csxl and they comprise a large 
group of disparate proteins exhibiting a wide range of sizes 
and very diverse sequences but share similar N-terminal 
domains. Significantly, a Csxl protein of S. islandicus was 
recently implicated in influencing the nucleic acid target- 
ing specificity of a subtype III-B interference complex. 13 

The present work did not include an analysis of toxin- 
antitoxin gene pairs and IS elements that are commonly 
associated with archaeal CRISPR systems. 15 Nor does it 
cover proteins implicated in modulating CRISPR RNA 
transcription and processing including Cbpl of the 
Sulfolobales 18 and Cbp2 of the Thermoproteales," which, 
like RNase III that mediates RNA processing in bacterial 
Type II CRISPR systems, 20 are not linked genomically to 
CRISPR systems, and are therefore likely to perform addi- 
tional cellular functions. 

CRISPR RNA processing 

Cas6 is the primary processing endoribonuclease for 
all archaeal CRISPR transcripts, which are generally tran- 
scribed as single transcripts from within the leader. 21,22 The 
cas6 gene is found either separately, or associated with gene 
cassettes for adaptation, Type I interference, Type III inter- 
ference, or combinations thereof (Table SI). Experimental 
data suggest that the crRNA processing capability of a 
single Cas6 protein can be shared by different CRISPR 
systems in the same host. 13 Consistent with this finding, we 
observed that cas6 genes associated with Type III systems are 
not phylogenetically distinct from those associated with Type 
I systems. This is visualized on the full archaeal Cas6 dendro- 
gram where Cas6 proteins associated specifically with Type HI 
systems are intermixed with those associated with Type I systems 
(Fig. S3, http://crispr.archaea.dk/FigureS3.pdf). 

Modular exchange 

The genomic organization of archaeal adaptation and interfer- 
ence genes as separate cas gene cassettes is consistent with their 
constituting separate functional modules. 15 Moreover, exchange of 
adaptation and interference modules between different CRISPR 
systems appears to have been widespread, 15,23 and this is exempli- 
fied for two methanoarchaea in Figure 6, where almost identical 
adaptation modules are associated with interference modules of 
three different Type I subtypes. To quantify the extent of modu- 
lar exchange, we analyzed the adaptation and interference mod- 
ule dendrograms (Figs. SI and S2). Groups of closely related 
adaptation modules were examined to establish whether their 
cognate Type I interference modules were also clustered. When 
locations on the dendrogram were inconsistent, the groups are 
color-coded in red, indicating that the adaptation modules can 
function with different interference modules and are, therefore, 
susceptible to modular exchange. The corresponding interference 
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Figure 4. Dendrogram of the classified Type III subtypes. Total numbers of 
identified subtypes are given on the tree. Gene contents, gene sizes, and gene 
syntenies are shown for representative interference modules of the different 
subtypes. The dendrogram represents a simplified and inverted bottom half 
of the full dendrogram in Figure S2. Gene contents are indicated by bold num- 
bers within genes, and for subtypes lll-A and lll-B, csm/cmr gene names are also 
given above the genes. The subtypes have distinct gene syntenies and branch 
before the defined threshold indicated by the light blue vertical line. Subtypes 
lll-C and lll-D are newly defined, and V -1 that was found only in some mem- 
bers of the Sulfolobales, and is therefore classified as a variant subtype. All sub- 
types have caslO, protein S, cas5, and multiple paralogs of cas7. asp denotes the 
gene of a putative aspartate protease. The subtype l-D gene cassette (in yellow) 
that branches at the junction of the Type III and Type I subtypes in Figure S2 is 
included as an outgroup. 



modules were also color-coded red. In summary, we estimate that 
about 50% of archaeal Type I systems are subceptible to exchange 
and that the phenomenon is particularly widespread for subtypes 



I-B, I-G, I-D, V r l,andV r 2. 



Discussion 

In this study we have examined all the currently detect- 
able CRISPR adaptive immune systems in sequenced archaeal 
genomes and have generated a catalog of different Type I and 
Type III subtypes found within each archaeon (Table SI). The 
analyses have focused primarily on the properties of interfer- 
ence modules because of the widespread occurrence of modular 
exchange and the finding that adaptation modules are relatively 
highly conserved in protein composition and sequence. While 
most archaeal Type I CRISPR systems constitute integral genetic 
units, many interference modules of Type III systems are encoded 
separately. The results obtained underline the high structural 
conservation of archaeal adaptation modules and the relatively 
broad structural diversity of interference modules. 

Most archaeal Type I systems fall into a few of the universal 
subtypes defined earlier, 5 albeit using different criteria. In addi- 
tion, we have assigned some Type III systems to other subtypes 
III-C, III-D, and V m -1, the former two of which (families A and 
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Table 5. Accessory proteins encoded by non-core cas genes associated primarily with Type III systems 



Name 


Number 


Example 


Description 


Type Ill-specific 




Csx1 proper 


65 


TTX_1 228 


protein (-450 aa) matching TF Cas_DxTHG and CDD Csx1_lll_U models 


Csx1 superfamily (Csxls) 


18 


TON_0318 


diverse group of proteins matching Csxl JII_U, including TF Cas02710. 
Does not include CasR and Csm6 


RecF-associated protein (RFas) 


16 


P186_0873 


diverse group of small proteins (-170 aa), always associated with 
Cas_RecF. 


Cas_RecF 


12 


P186_0874 


protein with RecF domain found near crenarchaeal Type III systems. 
Always associated with RFas 


Cas ABC ATPase (ABC_ATPase) 


11 


YN1551„2137 


protein with ABC ATPase domain found near crenarchaeal Type III 
systems. Always associated with sRRMs 


small RRM protein (sRRM) 


12 


YN1551_2138 


diverse group of proteins (-1 50 aa) containing RRM domains and always 
accociated with Cas ABC ATPase 


Aspartic protease (Asp_prot) 


11 


Saci_1895 


small protein (-1 00 aa) with retropepsin domain, associated with Type 
lll-D and V -1 subtypes in Sulfolobales 


Cmr7 


7 


SS01986 


protein (-200 aa) with metalloprotease domain, associated with certain 
variants of Type lll-B subtypes 


Csx3 


5 


Mhar_1706 


small (-100 aa) protein matching to theTF Cas_Csx3 model. Unrelated to 

Csxl 


Membrane dipeptidase 
(Zn 2+ _pep) 


4 


MJ_1 673 


small (-1 30 aa) protein with Zn 2+ dipeptidase domain, associated with 
some Type lll-A systems in Methanococcales 


Mvol_05XX-fam 


5 


Mvol_0536 


five proteins spanning three families associated with Type lll-B systems in 
Methanococcales. Also found in Clostridium sp. 


Types I + Ill-specific 




Cas HerA helicase (HerA) 


14 


Tagg_0815 


always associated with NurA 


Cas NurA nuclease (NurA) 


14 


Tagg_0814 


always associated with HerA 


Csxl minimal (Csxlm) 


13 


Pyrfu_0517 


small (-1 80 aa) protein containing a minimal Csxl domain matching TF 
Cas_NE01 1 3 and CDD CsxIJIIJJ 


Csxl with PIN toxin (Csxlp) 


8 


Metig_1253 


-400 aa protein consisting of a minimal Csxl domain fused to a PIN toxin 

domain 


Tneu_1 160-fam 


3 


Tneu_1160 


large protein of unknown function, associated with CRISPR systems of 
Thermoproteales 


SbcC repair ATPase (SbcC) 


3 


CSUB_C0986 


large protein with SbcC domain found in cas gene cassettes of diverse 
archaea 


SbcD nuclease (SbcD) 


3 


CSUB_C0987 


protein with SbcD domain, associated with SbcC 


Type l-specific 




Cas ABC ATPase (ABC_ATPase) 


5 


TTXJ256 


specific for Type I systems but related to corresponding protein for Type 
III systems 


small RRM protein (sRRM) 


4 


TTX_1 257 


specific for Type I systems but related to corresponding protein for Type 
III systems 


Thermococcales specific 
(see Table S2) 


53 


PF1 132-1 135 


1 1 protein families found in Thermococcales adjacent to Type 1 systems 
regardless of subtype 



Most of the accessory proteins are specific to CRISPR systems and are not normally found encoded outside of cas gene cassettes. In addition to 
the most widespread Csxl -type accessory proteins, numerous accessory proteins have helicase and nuclease domains, which tend to be found 
associated with DNA replication, recombination, and repair. Protease domains, toxin domains, and RRM domains are also found in many proteins, 
while some larger proteins have non-identifiable domains. 



D) were described earlier for archaea. 16 In their more general clas- 
sification, Makarova et al. 14 employed the signature protein CaslO/ 
Cmr2/Csml for classifying Type III systems, but this protein is 
extremely divergent in subtypes III-C, V m -1, and V m -2, and we 
suspect, therefore, that some Type III systems remain undetected. 



We successfully used the Pfam RAMP model (PF03787) as a sig- 
nature for Type III systems because this model is quite specific 
for Type III Cas7 proteins. Therefore, we predict that employing 
PF03787 as a universal signature for Type III systems will also lead 
to the discovery of new Type III subtypes in bacteria. 
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Figure 5. Examples of Type lll-A, Ill-B, and lll-D subtypes in archaea from diverse phyla that are associated with multiple and diverse accessory protein 
genes. Type III gene cassettes are color-coded red and cmr and csm gene labels are given. Type l-A gene cassettes are color-coded for adaptation (light 
blue) and interference (yellow). Some of the subtype Ill-B systems are directly linked to subtype l-A systems. All core cas genes are identified by bold 
numbers. Non-core cas-associated genes are color-coded violet and are labeled by names above the genes (see also Table 5). Genes for putative regula- 
tory proteins (green) and Cas6 (orange) are also indicated together with CRISPR loci (dark blue) where the number indicates the number of repeat-spacer 
units. In some Type III gene cassettes, the number of accessory protein genes exceeds the number of core cos genes. 



In late 2011, Makarova et al. 14 proposed unifying protein fami- 
lies of the different CRISPR types, Type I Cas and Type III Cmr 
and Csm, basically inferring that all Type I interference systems 
were made up of Cas3, Cas5, Cas7, and Cas8, and that Type III 
interference complexes contained Cas5, Cas7, CaslO, and the small 
protein component S. This proposal has received strong inde- 
pendent support from a number of recent structural studies. For 
example, the subtype I-E and subtype III-B interference complexes 
exhibit some common structural features 7,24 that are illustrated 
schematically in Figure 7. Type I interference modules generally 
encode a Cas3 helicase, Cas5 RAMP, Cas7 RAMP, and Cas8, and 
in the subtype I-E complex, Cas5 is embedded in Cas8 to form 
the base of the complex from which a backbone of multiple Cas7 
subunits extends (Fig. 7A). For some subtypes, Cas8 is split into 
Cas8' and Cas8," and Cas3 can also be partitioned. The interfer- 
ence complex of subtype III-B does not carry a Cas3 helicase but it 
exhibits analogous structural features, where Cas8' is replaced by 
CaslO and Cas8" is exchanged with protein S (Fig. 7B). Moreover, 
Cas5 is also embedded in CaslO and the shared Cas7 backbone 
interfaces with other parts of the complex through divergent Cas7 
paralogs, the numbers of which differ between different Type III 
subtypes (Fig. 7B). Although protein sequence similarities were 
very difficult to detect by conventional sequence comparison 
methods, for many predicted homologs, 14 using the profile-pro- 
file approach employed here we were able to detect and confirm 
the predicted homologies. The resulting reduction in the number 



of core Cas protein families considerably facilitated Cas protein 
identification and annotation of cas genes. It is likely that some 
additional Cas proteins described in a recent metagenomic study 
of bacterial CRISPR systems 25 could have been assigned within 
this unified Cas protein classification if more sensitive sequence 
comparative methods had been employed. 

Extensive evidence supporting the periodic exchange of adap- 
tation and interference modules between different archaeal Type 
I systems is summarized in Figures SI and S2 and the data are 
consistent with earlier evidence presented for archaeal CRISPR 
modular exchange. 15,23 Presumably this widespread phenom- 
enon reflects that the structure and function of the adaptation 
modules are highly conserved. This property also undermines 
using the phylogeny of the conserved adaptation protein Casl for 
determining CRISPR subtypes. However, the sole employment 
of interference modules for classifying archaeal CRISPR systems 
also carries the reservation that they may function with different 
types of adaptation modules. 

We still have limited insights into the detailed mechanistic 
and functional diversity of the interference modules of the differ- 
ent Type I and Type III subtypes. 9 ' 10 ' 12 Closely related S. islandicus 
strains were shown earlier to carry multiple and different Type III 
subtypes, 26 which increases the likelihood that their interference 
mechanisms are functionally diverse. Furthermore, the recent 
broad unification of core Cas protein families has provided a 
basis for discerning non-core proteins that are co-functional with 
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Figure 6. A putative example of modular exchange of Type I adaptation and interference modules. Gene cassettes for adaptation (coded blue) and 
interference (coded yellow) for Methanocaldococcus vulcanius and Methanocaldococcus fervens genomes are depicted showing the adaptation modules 
with highly similar sequences (93% concatenated amino acid sequence identity) that are associated with three different subtypes (l-A, l-D, and l-G) of 
Type I interference modules. Gene contents are indicated by bold numbers within genes. R (green) denotes genes of putative transcriptional regulators 
and 6 (orange) indicates genes of the Cas6 RNA processing enzyme. 




Figure 7. Schematic comparison of (A) a subtype l-E interference com- 
plex from E. coli 24 with (B) a subtype lll-B interference complex from 
Pyrococcus furiosus. 7 The two structures share homologous protein 
components consistent with the related compositions of their gene cas- 
settes. In (A), the Cas6 protein is considered to be part of the interfer- 
ence complex and Cas3 is essential for the interference mechanism. In 
(B), Cas 6 has not been shown to be part of the interference complex. 
Moreover, the two Cas7 proteins in green (Cmr1) and orange Cmr6 (see 
Fig. 4) are Cas7 paralogs. The estimated binding sites of the crRNAs are 
color-coded brown. 



CRISPR systems. Several accessory protein families were identi- 
fied in this study that are exclusively or primarily encoded within, 
or adjacent to, a limited number of archaeal Type III interference 
gene cassettes. Two of these, HerA and NurA, were identified ear- 
lier for CRISPR systems of different Pyrobaculum species. 17 It is 
likely that some of the proteins, which include putative proteases 
and ATPases, can modify or extend functions of the associated 
CRISPR systems. This supposition recently received experimen- 
tal support from one of two Type III-B systems in S. islandicus 
REY15A that was demonstrated to share a CRISPR spacer, and a 
Cas6 RNA processing enzyme, with a Type I-A system, but was 
functionally dependent on an accessory Csxl protein. 13 

Proteins annotated as Csxl do not comprise a single protein 
family, but rather a number of diverse protein families exhibit- 
ing different sizes and variable domain architectures. What they 
share is an N-terminal domain of approximately 150 amino acids 
in length containing a DxTHG motif or variants thereof. This 



domain, which defines what can be called the Csxl superfam- 
ily of proteins, is also found in the cas transcriptional regulators 
CasR and Csm6 where it is fused to a C-terminal HTH domain. 
While it is not clear why such a conserved domain should be 
fused to a wide range of different proteins, the conserved domain 
could provide a site for interfacing with Type III and Type I inter- 
ference modules to modify their activities. 

CRISPR-based immune systems probably originated in early 
primitive cellular structures where exchange of genetic material 
was likely to have been common. 27,28 Moreover, their widespread 
presence in most archaea and many bacteria suggests that they 
predated the inferred branching of the bacterial domain, although 
the bacterial Type II system is likely to have evolved into its pres- 
ent form later. We have discussed earlier the evidence supporting 
inter-genomic exchange of CRISPR systems between archaea. 23 
This was considered to be facilitated by the often integral nature 
of the CRISPR-c^ gene cassettes, which are often located in vari- 
able genomic regions and are sometimes bordered by transpos- 
able elements. 16 Nevertheless, the relative similarity of protein 
components of a few archaeal and bacterial subtypes, combined 
with their biased distributions, suggests that some inter-domain 
exchange of CRISPR systems has occurred, despite the presence 
of formidable genetic barriers. 23 For example, the relatively com- 
mon bacterial subtypes I-C and I-E occur rarely among archaea, 
and no examples of bacterial subtype I-F, were found (Fig. 1), 
suggesting that rare transfers from bacteria to archaea may have 
occurred, and primarily among the euryarchaea. 

Crenarchaea, which tend to occupy archaea-rich thermophilic 
or acidothermophilic environments, are relatively homogeneous 
in their CRISPR systems. They exhibit predominantly subtypes 
I-A (90%) and III-B (53%), which suggests that they have under- 
gone very few, if any, transfers of CRISPR systems from other 
archaeal kingdoms or from bacteria. In contrast, the euryarchaea 
carry several examples of subtypes I-A, I-B, I-D, and I-G, and 
subtype HI-A (52%) which constitutes the most common Type 
III system. Euryarchaea frequent a wide range of natural environ- 
ments and are more diverse phylogenetically than crenarchaea. 
Moreover, many of their natural habitats are relatively rich in 
bacteria, possibly rendering their CRISPR systems more suscep- 
tible to exchange between archaea and bacteria. 
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Our aim in this work was to complete a comprehensive analy- 
sis of the archaeal CRISPR systems. Here we provide an interac- 
tive system, in Table SI, and in Figure S2, which can be used for 
future analysis and characterization of these systems, including 
further examination of the different variant systems as well as the 
possible functional roles of the non-core Cas proteins. 

Materials and Methods 

Bioinformatical analyses 

Genomic loci encoding archaeal adaptation, Type I interfer- 
ence, and Type III interference modules were first identified in 
the set of 159 complete archaeal genomes by searching for casl 
and cas7 genes using custom HMMs 29 and the Pfam 30 RAMPs 
model. The identified casl and cas7 genes and the 10 genes flank- 
ing them on either side were pooled, and the whole pool was 
subject to an all-against-all pairwise sequence alignment com- 
parison, 31 which was used as an input for Markov clustering 32 
using custom similarity measures. Each protein family result- 
ing from the Markov clustering was searched against protein 
family databases CDD, 5 COG, 4 TIGFAMs, 3 and Pfam 30 using 
profile-profile alignments with HHsearch. 33 Each genomic 
locus containing the genes was inspected manually, and using 
the information from the profile— profile comparison, as well 
as conventional sequence searches against public databases, cas 
modules, and cas gene families, were defined, cas cassettes were 
also defined as consisting of collections of modules (Table SI). 
Adaptation and interference module dendrograms (Figs. SI and 



S2) were created by comparing all protein components of each 
type of module against corresponding modules from other cas- 
settes. By keeping track of the modules corresponding to each 
protein component, module-to-module similarity scores were 
calculated by adding up scores for all constituent proteins. Scores 
between all modules were inverted and normalized to create a 
module-to-module distance matrix. The distance matrix was 
used as input for constructing a neighbor-joining tree, 34 which 
was subsequently mid-point rooted. Variant systems that could 
not be classified as either Type I or Type III were not included in 
the interference dendrogram. 
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